Esteban Montenegro-Montenegro, PhD
Psychology and Child Development
We will study the concept of p-value a.k.a significance test.
I will introduce the concept of hypothesis testing.
I will also introduce the mean comparison for independent groups.
p-value stands for probability value.
It was born as a measure to reject a hypothesis.
In statistics and science, we always have hypothesis in mind. Statistics translates our hypothesis into evaluations of our hypothesis.
For example, we often ask questions to our selves such as: why my boyfriend won’t express her feelings? and then we ask, is this related to gender? Is it true that women share easily their emotions compare to men? If so, does it happen only to me? Is this a coincidence?
We can create a hypothesis with these questions, let’s try to write one:
\(H_{1}\) = There is a difference in emotional expression between cisgender women and cisgender men.
\(H_{0}\) = There is no difference in emotional expression between cisgender women and cisgender men.
That was easy you might think, but why do we need a null statement?
\(H_{0}\) = The mean difference in emotional expression between cisgender women and cisgender men is equal to zero.
In the previous hypothesis we know that we are focusing in the mean difference, it is more specific.
The very first step to test our null hypothesis is to create a null model.
A null model is a model where there is not any difference between groups, or there is not relationship between variables.
In my new model I will find a null model for the correlation between rumination and depression.
To create our new model we will re sample and shuffle our observed data. This is similar to have two decks of cards and you shuffle your cards multiple times until it is hard to guess which card will come next, and imagine cards have equal probability to be selected.
This procedure is called permutations, this will help us to create a distribution of null correlations. This means, all the correlations produced by my null model are produced by chance.
rum <- read.csv("ruminationComplete.csv", na.string = "99") ## Imports the data into R
rum_scores <- rum %>% mutate(rumination = rowSums(across(CRQS1:CRSQ13)),
depression = rowSums(across(CDI1:CDI26))) ### I'm calculating
## total scores
corr <- cor(rum_scores$rumination, rum_scores$depression,
use = "pairwise.complete.obs") ## Correlation between rumination and depression
### Let's create a distribution of null correlations
nsim <- 100000
cor.c <- vector(mode = "numeric", length = nsim)
for(i in 1:nsim){
depre <- sample(rum_scores$depression,
212,
replace = TRUE)
rumia <- sample(rum_scores$rumination,
212,
replace = TRUE)
cor.c[i] <- cor(depre, rumia, use = "pairwise.complete.obs")
}
hist(cor.c, breaks = 120,
xlim= c(min(cor.c), 0.70),
main = "Histograma of null correlations")
abline(v = corr, col = "darkblue", lwd = 2, lty = 1)
abline(v = c(quantile(cor.c, .025),quantile(cor.c, .975) ),
col= "red",
lty = 2,
lwd = 2)Let’s estimate the probability of seeing \(r = 0.58\) according to our null model.
The probability is a number close to \(0.00\).
We now conclude that a correlation as extreme as \(r = 0.58\) is not explained by chance alone.
Note:
The ugly rule of thumb is to consider a p-value <= .05 as evidence of small probability.
We can also do the same for the difference between means.
In this example, my null model is a model with null differences.
I also conducted permutations on this example.
We will estimate if the difference by sex in rumination is explained by chance.
In this sample the mean difference in rumination between males and females is \(\Delta M\) = 2.74.
set.seed(1236)
ob.diff <- mean(rum_scores$rumination[rum_scores$sex==0], na.rm = TRUE )- mean(rum_scores$rumination[rum_scores$sex==1], na.rm = TRUE)
### let's create a distribution of null differences
nsim <- 100000
diff.c <- vector(mode = "numeric", length = nsim)
### This is something called "loop", you don't have to pay attention to this.
for(i in 1:nsim){
women <- sample(rum_scores$rumination[rum_scores$sex==0],
length(rum_scores$rumination[rum_scores$sex==0]),
replace = TRUE)
men <- sample(rum_scores$rumination[rum_scores$sex==1],
length(rum_scores$rumination[rum_scores$sex == 1]),
replace = TRUE)
diff.c[i] <- mean(women, na.rm = TRUE)-mean(men, na.rm = TRUE)
}
hist(diff.c, breaks = 120,
#xlim= c(min(diff.c), 0.70),
main = "Histogram of Null Differences")
abline(v = ob.diff, col = "darkblue", lwd = 2, lty = 1)
abline(v = c(quantile(diff.c, .025), quantile(diff.c, .975) ),
col= "red",
lty = 2,
lwd = 2)In real life, we don’t have to estimate a null model “by hand” as I did before.
R and JAMOVI will help us on that because the null model is already programmed.
In addition, when we compare independent means, we don’t usually do permutations. We follow something called the \(t\) - distribution , let’s study more about this distribution: t-distribution.
The \(t\)-distribution helped to develop the test named t-Student Test.
In this test we use the \(t\)-distribution as our model to calculate the probability to observe a value as extreme as 2.74.
But this probbaility will be estimated following the Cumulative Density Function (CDF) of the \(t\)-distribution.
The Student’s test is also known as a the “t-test”.
In this test, we will transform the mean difference of both groups into a \(t\) value.
In this transformation \(n_{1}\) is the sample size for group 1, \(n_{2}\) is the sample size for group 2, \(s^2\) means variance. The \(\bar{X}\) represents the mean.
This formula will help us to tranform the oberved difference in means to a value that comes from the \(t\)-distribution.
We can see how useful is a \(t\)-test by presenting a applied example.
In this example we will try to reject the null hypothesis that says:
“The rumination score in males is equal to the rumination score in females”
Also in this example I’m introducing a new function in R named t.test(). This is the function that will helps to know if we can reject the null hypothesis.
The function t.test() requires a formula created by using tilde ~.
In R the the variable on the right side of ~ is the independent variable, the variable on the left side of ~ is the dependent variable.
In a independent samples \(t\)-test the independent variable is always the group, and the dependent variable is always any continuous variable.
Two Sample t-test
data: rumination by sex
t = 2.2457, df = 203, p-value = 0.0258
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
0.3347849 5.1535481
sample estimates:
mean in group 0 mean in group 1
31.20896 28.46479
***“IF we repeat the same analysis with multiple samples the probability of finding a*** \(t\)-value = 2.25 is p = 0.03, under the assumptions of the null model”.
We haven’t talked about the assumptions of the t-test model.
Remember that all models have assumptions, we have to assume something to study Nature.
The \(t\)-test model assumes that the difference in means is generated by a normally distributed process.
It also assumes that variances are equal on both groups.
Let’s see what happens when we assume equal variances but the data does not come from a process with equal variances:
set.seed(1234)
N1 <- 50 ## Sample size group 1
N2 <- 50 ### sample size group 2
Mean1 <- 100 ## Mean group 1
Mean2 <- 20 ### Mean group 2
results <- list()
for(i in 1:10000){
group1 <- rnorm(N1, mean = Mean1, sd = 100) ### variances or standard deviation are not equal
group2 <- rnorm(N2, mean = Mean2, sd = 200) ### variances or standard deviation are not equal
dataSim <- data.frame(genValues = c(group1,group2),
groupVar = c(rep(1,N1),rep(0,N2)))
results[[i]] <- t.test(genValues ~ groupVar, data = dataSim, var.equal = TRUE)$p.value
}
### Proportion of times we rejected the null hypothesis
cat("Proportion of times we rejected the null hypothesis",sum(unlist(results) <= .05)/length(results)*100)Proportion of times we rejected the null hypothesis 70.5
We successfully rejected the null hypothesis in only 70.5% of the data sets generated. But in reality the \(t\)-test should reject the null hypothesis 100% of the times.
Let’s check when we assume equal variances in our \(t\) - test and the model that generates is actually a process with equal variances:
set.seed(1234)
N1 <- 50
N2 <- 50
Mean1 <- 100
Mean2 <- 20
results_2 <- list()
for(i in 1:10000){
group1 <- rnorm(N1, mean = Mean1, sd = 5) ## equal variances or standard deviation
group2 <- rnorm(N2, mean = Mean2, sd = 5) ## equal variances or standard deviation
dataSim <- data.frame(genValues = c(group1,group2),
groupVar = c(rep(1,N1),rep(0,N2)))
results_2[[i]] <- t.test(genValues ~ groupVar, data = dataSim, var.equal = TRUE)$p.value
}
### Probability of rejecting the null hypothesis
cat("Proportion of times we rejected the null hypothesis", sum(unlist(results_2) <= .05)/length(results_2)*100)Proportion of times we rejected the null hypothesis 100
This time we are rejecting the null hypothesis 100% of the times. This is what we were looking for! Remember that we generated data from a process where group 1 had a mean of 100, and the group 2 had a mean of 20. The t-test should reject the null hypothesis every time I generate new data sets, but this doesn’t happen when I made the wrong assumption: I assumed equal variances when I should not do it.
Summary: when we wrongly assume that the variances are equal between groups, we decrease the probability to reject the null hypothesis when the null should be rejected. This is bad!
These simulations showed the relevance of respecting the assumptions of the \(t\)-test.
There are tests to evaluate the assumption of equivalence of variance between groups.
The most used test to evaluate the homogeneity of variance is the Levene’s Test for Homogeneity of Variance.
We can implement this test in R using the function leveneTest() , this function comes with the R package car. You might need to install this package in case you don’t have it installed in your computer, you can run the this line of code to install it: install.packages("car")
I’m going to test if the variance of rumination holds the assumption of equality of variance by sex:
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 1 0.2541 0.6148
203
In this test the null hypothesis is “The variances of group 1 and group 2 are equal”, if the \(p\)-value is less or equal to 0.05 we reject the null hypothesis. In the output above you can see the p-value under the column Pr(>F).
In the case of rumination, the \(p\)-value = 0.61, given that the \(p\)-value is higher than 0.05, we don’t reject the null hypothesis. We can assume the variances of rumination by sex are equal.
What happens if the Levene’s Test rejects the null hypothesis of homogeneity of variance?
Can we continue using the \(t\) - test to evaluate my hypothesis?
The answer is: Yes you can do a \(t\)-test but there is a correction on the degrees of freedom. We will talk more about degrees of freedom in the next sessions.
If you cannot assume equality of variances, all you have to do in R is to switch the argument var.equal = TRUE to var.equal = FALSE.
Welch Two Sample t-test
data: rumination by sex
t = 2.2841, df = 149.72, p-value = 0.02377
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
0.3702483 5.1180847
sample estimates:
mean in group 0 mean in group 1
31.20896 28.46479
Up to this point we have studied how we test our hypothesis when we compare independent means, but we still have to answer the question, how large is a large difference between means? Or, how small is a small difference? In fact, what is considered a small difference?
These questions were answered by Jacob Cohen (1923-1998).
Cohen created a standardized measure to quantify the magnitude of the difference between means.
Let’s see a pretty plot about it in this link.
Cohen’s \(d\) Effect Size:
\[\begin{equation} ES = \frac{\bar{X}_{1}-\bar{X}_{2}}{\sqrt{\Big[\frac{s^2_{1}+s^2_{2}}{2} \Big ]}} \end{equation}\]Where:
R. In this example we will resume our \(t\)-test example of rumination by sex.
Welch Two Sample t-test
data: rumination by sex
t = 2.2841, df = 149.72, p-value = 0.02377
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
0.3702483 5.1180847
sample estimates:
mean in group 0 mean in group 1
31.20896 28.46479
In the \(t\)- test output we have information we can use to compute the effect size. We have the mean for each group \(M_{(female)} = 31.21\), \(M_{(male)}= 28.46\). We already know that the difference between groups is not explained by chance alone (\(p\)= 0.02) because the p-value is smaller than 0.05.
Now let’s compute the effect size,
| sex | Mean | Var |
|---|---|---|
| 0 | 31.21 | 71.88 |
| 1 | 28.46 | 64.40 |
Effect size computation
The effect size of sex on rumination is d = 0.33
We can make this calculation more fun. Let’s imagine you are a clinician, and you need a way to understand how large is the difference in rumination and depression. You need to find a way to quantify which variable you should pay more attention, which variable has larger differences between cisgender males and cisgender females.
Why would you care about differences between cisgender males and cisgender females? Why would you care to compare the differences between rumination and depression?
Firstly, we need to compute the depression score. In the rumination data set you’ll find columns named CDI these are the items that correspond to the Children’s Depression Inventory. We will use this items to compute a total score.
We need to check if the variances our depression score are equal between groups before running a \(t\)-test:
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 1 0.8682 0.3526
205
## This test assumes equality of variances after performing the Levene's Test.
t.test(depression ~ sex, data = rum_scores, var.equal = TRUE)
Two Sample t-test
data: depression by sex
t = 2.5468, df = 205, p-value = 0.01161
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
0.5257305 4.1298205
sample estimates:
mean in group 0 mean in group 1
11.257353 8.929577
Important
How do you know the difference is not explained by chance alone?
| sex | Mean | Var |
|---|---|---|
| 0 | 11.26 | 36.86 |
| 1 | 8.93 | 43.04 |
Effect size computation
meanFemales <- InfoByBroupDepression$Mean[1]
meanMales <- InfoByBroupDepression $Mean[2]
varianceFemales <- InfoByBroupDepression$Var[1]
varianceMales <- InfoByBroupDepression$Var[2]
effectSizeDepression <- (meanFemales-meanMales)/sqrt((varianceFemales+varianceMales)/2)
cat("The effect size of sex on depression is", "d =", round(effectSizeDepression,2))The effect size of sex on depression is d = 0.37
What is your conclusion as a clinician? Is rumination too much different from depression?
Should we focus on an intervention that addresses depression or/and rumination differently according to sex?
Should you create a depression intervention accounting for the effect of sex ? Can you ignore rumination for your intervention?
Should we focus mainly on women because the difference is large between groups?
Where:
\(\bar{X}\) = Mean
\(s^2\) = variance
\(n\) = sample size
We can also estimate the effect sizes with pacakages already programmed to be used in R. In this case we can use the package effectsize.
You don’t have this package installed in your R installation. You will have to install the package by running install.packages("effectsize", dependencies = TRUE).
You only need to install a package one time. You don’t have to install packages everytime you open RStudio.
I hope you remember JAMOVI. I mentioned this software in the first class.
It is also listed in the syllabus.
This is a software that runs R using a interface that does not require to write the R code. JAMOVI will write the R code for you behind scenes, it also runs the code for you. Pretty awesome isn’t it?
The CLT states: For any population with mean \(\mu\) and standard deviation \(\sigma\), the distribution of sample means for sample size \(n\) will have a mean of \(\mu\) and a standard deviation of \(\frac{\sigma}{\sqrt{n}}\) and will approach a normal distribution as \(n\) approaches infinity.
The beauty of this theorem is due to:
it describes the distribution of sample means for any population, no matter what shape, mean, or standard deviation.
The distribution of sample means “approaches” a normal distribution very rapidly.
It is better to explain this theorem with a simulation, as always we do in this class.
In this example I’ll generate 1000 “datasets” from a Gaussian process (normal distribution) with M = 60, sd = 100. Each data set will contain 10 observations.
After generating 1000 data sets, I’m estimating the mean of those simulated values.
Imagine you are asking the question how many minutes do you need to take a shower? to 10 participants, you then conduct the same study the next day with a different set of 10 participants every day until you get 1000 sets of 10 participants.
There is more beautiful qualities about the CLT. This rule applies to all distributions. When you increase the number of observations per sample, and the number of sample approach infinity. The sample of means will look Gaussian distributed (normally distributed).
Remember the Poisson distribution I mentioned some lectures behind?
As you can see the Poisson distribution after generating 1 data set of 10 observations doesn’t look like a Gaussian process at all!
Can the sample of means from a Poisson process generate a pretty bell shape?
We saw that the mean of sample means is close to the data generating process mean , and it doesn’t matter the type of distribution. However, take into account that there are probability distribution models that will require more than 800000000000 sampled means to approach the Gaussian shape. But eventually the distribution of all the means will look like a Gaussian distribution.
In simple terms, what does it mean? It means that the mean is an unbiased estimate on average it tends to be close to the “population” mean. This is why we like the mean as an estimate , and we use it to test hypothesis.
The CLT will be helpful to introduce the concept of standard error.
The Standard Error is a measure of how far is the mean of my sample from the true value in the data generating process.
The Standard Error in the context of multiple samples is the standard deviation of all sampled means.
We should go to another example:
The Standard Deviation of 1000 sampled means is 33.25718
In real life, you’ll never know the standard deviation of sample means, because this is a simulated example.
In real application you need to estimate how far your observed mean is from the true mean.
We can use this estimation \(\frac{\sigma}{\sqrt(n)}\) to approximate how far we are from the mean. In this formula \(\sigma\) is the standard deviation and \(n\) is the sample size.
The standard error decreases when we increase the sample size, let’s see another example:
Important
In simple words: The standard error (SE) measures the distance from the true value in the data generating process.